Copy, please, these files and directories to your personal directory:

cp -r /data/shared/AGE2020/Exercises/E03-qPCR ~/AGE2020/Exercises

And switch the R working directory to the current exercise: setwd("~/AGE2020/Exercises/E03-qPCR")


Introduction

Quantitative real-time PCR monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real time), not at its end, as in conventional PCR.

In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules. This amplification depends on initial RNA concentration and approximates the gene expression. We select a fluorescence signal threshold S_T and measure crossing points C_P (also called threshold cycle C_T) for samples. To correct for initial RNA concentration and quality, reference genes are used (dashed curves). These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline. First, we normalize target genes in each sample to reference gene (or they geometric mean) by calculating \Delta C_t = C_P^t - C_P^r. Obtained \Delta C_t values are then used to calculate the relative change of gene expression between samples. In this case, we are interested how much more or less is gene expressed in \text{treatment} sample (\text{trt}), relative to \text{control} (\text{cnt}) sample. This relative change is called \text{fold-change}: FC = 2^{-\Delta\Delta C_P} = 2^{-(\Delta C_{P, trt} - \Delta C_{P, cnt})}. We usually report FC in log_2 scale: LFC = \Delta C_{P, cnt} - \Delta C_{P, trt} = log_2(FC) Important is to realize that the higher C_P is, the lower initial mRNA concentration was. This is why we put minus /-/ in front of \Delta\Delta C_t. We usually don’t consider the efficiency of PCR and expect the mRNA concentration to double in each cycle. Otherwise FC = (1+E)^{-\Delta\Delta C_P}, where E is the PCR efficiency.

In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules. This amplification depends on initial RNA concentration and approximates the gene expression. We select a fluorescence signal threshold \(S_T\) and measure crossing points \(C_P\) (also called threshold cycle \(C_T\)) for samples. To correct for initial RNA concentration and quality, reference genes are used (dashed curves). These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline. First, we normalize target genes in each sample to reference gene (or they geometric mean) by calculating \(\Delta C_t = C_P^t - C_P^r\). Obtained \(\Delta C_t\) values are then used to calculate the relative change of gene expression between samples. In this case, we are interested how much more or less is gene expressed in \(\text{treatment}\) sample (\(\text{trt}\)), relative to \(\text{control}\) (\(\text{cnt}\)) sample. This relative change is called \(\text{fold-change}\): \(FC = 2^{-\Delta\Delta C_P} = 2^{-(\Delta C_{P, trt} - \Delta C_{P, cnt})}\). We usually report \(FC\) in \(log_2\) scale: \(LFC = \Delta C_{P, cnt} - \Delta C_{P, trt} = log_2(FC)\) Important is to realize that the higher \(C_P\) is, the lower initial mRNA concentration was. This is why we put minus /\(-\)/ in front of \(\Delta\Delta C_t\). We usually don’t consider the efficiency of PCR and expect the mRNA concentration to double in each cycle. Otherwise \(FC = (1+E)^{-\Delta\Delta C_P}\), where \(E\) is the PCR efficiency.

This will be an introductory exercise in which you will refresh your R skills and implement (mainly visualization) functions, which you will use later on different types of gene expression data.

Our experiment

We will work with data from pancreatic tumour experiment, where an influence of spirulina algae was investigated. In this experiment there are two sample groups:

  • control: samples from tumour.
  • spirulina: samples from tumour treated by extract from spirulina.

Each sample group has three biological replicates (grown on different Petri dish). In addition, each biological replicate is cultivated for 50% and 90% confluence, and for each confluence there are two technical replicates. Uhhh…

But we can expand it to:

- control group
  |
  |-- biological replicate #1 (Petri dish #1)
  |   |-- confluence 50
  |   |   |-- technical replicate #1 -> sample C1_C50_R1
  |   |   |-- technical replicate #2 -> sample C2_C50_R2
  |   |-- confluence 90
  |       |-- technical replicate #1 -> sample C1_C90_R1
  |       |-- technical replicate #2 -> sample C1_C90_R2
  |-- biological replicate #2 (Petri dish #2)
      |-- confluence 50
      |   |-- technical replicate #1 -> sample C2_C50_R1
      |   |-- technical replicate #2 -> sample C2_C50_R2
      |-- confluence 90
          |-- ...

- spirulina group
  |
  |-- biological replicate #1 (Petri dish #1)
  |   |-- confluence 50
  |   |   |-- technical replicate #1 -> sample S1_C50_R1
  |   |   |-- technical replicate #2 -> sample S1_C50_R2
  |   |-- confluence 90
  |       |-- technical replicate #1 -> sample S1_C90_R1
  |       |-- technical replicate #2 -> sample S1_C90_R2
  |-- biological replicate #2 (Petri dish #2)
      |-- confluence 50
      |   |-- technical replicate #1 -> sample S2_C50_R1
      |   |-- technical replicate #2 -> sample S2_C50_R2
      |-- confluence 90
          |-- ...

If you want, we can draw it on the board :)

Also, there are two groups of measured genes:

  1. Target genes - interesting genes found by microarray exploratory analysis.
  2. Housekeeping genes - used as reference genes.

Libraries

library(dplyr)
library(stringr)
library(tidyr)
library(glue)
library(magrittr)

This will be your personal library in which you implement the tasks. Now it contains only function skeletons. These skeletons are not obligatory, but your functions should have output similar to what you will see in this tutorial.

Note that your functions will be universal, serving not only for qPCR data. This is mainly true for exploratory analysis functions (hiearchical clustering, PCA, etc.), where input (expression matrix) is same also for microarrays and RNA-seq data.

source("../age_library.R")

Config

I would recommend you to always define constants at the beginning of a script. It is more transparent, and if you, for example, change some path, you haven’t to replace it in whole source code. Constants are, by convention, named in UPPERCASE.

CP_DATA <- "qPCR.Rds"

Data preprocessing

\(C_P\) matrix

We have already preprocessed a \(C_P\) matrix for you. But if you really want to start from the floor, you can create the \(C_P\) matrix from original files (here and here).

In \(C_P\) matrix, columns are samples, rows are genes, and values are \(C_P\) values. This format is generally used for gene expression data and you will see it also in case of microarrays and RNA-Seq.

cp <- readRDS(CP_DATA) %>%
  as.data.frame()
genes <- rownames(cp)
samples <- colnames(cp)

cp[1:5, 1:5]

CP values >= 40 are considered out of qPCR limit:

cp[cp > 39.9] <- NA

Some genes have so many NA values:

lq_genes <- rowSums(!is.na(cp))
lq_genes
##     CD24     CD31   CXCL12    CXCR4    CXCR7     FLT1    VEGFA   VEGRF2  ACTB H4 GAPDH H5 RPS9 H11  TBP H26 
##       24       19        2       10        0        0       24        2       24       24       24       24

Let’s remove them:

lq_genes <- names(lq_genes[lq_genes <= 2])
genes <- setdiff(genes, lq_genes)
cp <- cp[genes, ]

Now we split genes to target and reference ones. Latter have H on the end of their names:

target_genes <- genes[str_detect(genes, ".+ H", negate = TRUE)]
ref_genes <- setdiff(genes, target_genes)

gene_groups <- if_else(str_detect(genes, " H\\d+$"), "reference", "target")
(genes_df <- data.frame(
  gene = as.character(genes),
  gene_group = factor(gene_groups, levels = c("reference", "target")),
  
  row.names = genes,
  stringsAsFactors = FALSE)
)

For plotting purposes, we replace NAs by 40:

cp_plot <- replace(cp, is.na(cp), 40)

Phenotypical data (sample sheet)

You usually obtain a sample sheet, but in this case we are going to parse the column names of \(C_P\) matrix. Everything needed is contained there:

  • K and SP: sample groups
  • Number behind sample group: Petri dish (= biological replicate)
  • Number behind slash (“/”): confluence
  • (2RT): second technical replicate
(sample_names <- colnames(cp))
##  [1] "K1/50"        "K1/50(2RT)"   "K1/90"        "K1/90 (2RT)"  "K2/50"        "K2/50(2RT)"   "K2/90"        "K2/90 (2RT)"  "K3/50"        "K3/50(2RT)"   "K3/90"       
## [12] "K3/90 (2RT)"  "SP1/50"       "SP1/50 (2RT)" "SP1/90"       "SP1/90 (2RT)" "SP2/50"       "SP2/50 (2RT)" "SP2/90"       "SP2/90 (2RT)" "SP3/50"       "SP3/50(2RT)" 
## [23] "SP3/90"       "SP3/90 (2RT)"
pheno_data <- data.frame(
  sample_id = sample_names,
  sample_group_code = str_match(sample_names, "^SP|K")[, 1] %>% recode_factor("K" = "c", "SP" = "sp"),
  petri_number = str_match(sample_names, "^[A-Z]{1,2}(\\d)")[, 2] %>% as.factor(),
  confluence = str_match(sample_names, "\\/(\\d{2})")[, 2] %>% as.factor(),
  replicate = str_match(sample_names, "\\((\\d)RT\\)")[, 2] %>% replace_na("1") %>% as.factor()
) %>%
  dplyr::mutate(
    sample_name_rep = glue("{sample_group_code}{petri_number}_{confluence}_r{replicate}") %>% as.character(),
    sample_name = glue("{sample_group_code}{petri_number}_{confluence}") %>% as.character(),
    sample_group = recode_factor(sample_group_code, "c" = "control", "sp" = "spirulina")
  ) %>%
  dplyr::select(sample_name_rep, sample_name, everything()) %>%
  set_rownames(.$sample_name_rep)

colnames(cp) <- rownames(pheno_data)
colnames(cp_plot) <- rownames(pheno_data)
samples <- colnames(cp)

head(pheno_data)

Long data

We also create long data from cp, pheno_data and genes_df.

cp_long <- as.data.frame(cp) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name_rep", values_to = "cp") %>%
  dplyr::left_join(pheno_data, by = "sample_name_rep") %>%
  dplyr::left_join(genes_df, by = "gene") %>%
  dplyr::mutate(
    cp_plot = if_else(is.na(cp), 40, cp)
  ) %>%
  dplyr::select(sample_name_rep, sample_name, gene, cp, cp_plot, gene_group, everything())

head(cp_long)

Exploratory analysis on raw data

We are using exploratory analysis to have an unbiased first look at data. It is also a crucial step to assess the biological quality control, e.g. whether samples from different groups create separate clusters - if not, samples might be swapped.

Hiearchical clustering (TASK 1)

TASK 1: Implement your own function which takes an expression matrix and returns a dendrogram with coloring by a chosen variable.

color_variables <- c("sample_group", "confluence", "petri_number")

for (variable in color_variables) {
  plot_hc(
    cp_plot, color_by = pheno_data[, variable],
    method_distance = "euclidean", method_clustering = "complete",
    color_by_lab = variable
  )
}

PCA (TASK 2)

TASK 2: Implement a function for PCA visualization with point coloring by a chosen variable.

  • Base R:
plot_pca(cp_plot, color_by = pheno_data$sample_group, color_by_lab = "sample_group")

  • You can use ggplot2, but be prepared to use data in long format:
plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence")$plot

plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name_rep")$plot

plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "petri_number")$plot

  • You can also use GGally::ggpairs() to plot a grid of PCA plots (PC1 vs. PC2, PC1 vs. PC3, PC2 vs. PC3, etc):
plot_pca_ggpairs(cp_plot, pheno_data, n_components = 5, color_by = "sample_group", shape_by = "confluence")$plot

Heatmap (TASK 3)

TASK 3: Implement your own function for heatmap.

You can use a nice package heatmaply which produces interactive HTML heatmaps. Or pheatmap for a better alternative to heatmap.2().

plot_heatmap(as.matrix(cp_plot), color_by = pheno_data$sample_group, color_legend_lab = "Sample group", main = "Treatment (Cp values)")

plot_pheatmap(cp_plot, pheno_data, genes_df, column_color_by = color_variables, row_color_by = "gene_group", main = "qPCR")

plot_heatmaply(cp_plot, pheno_data, genes_df, column_color_by = color_variables, row_color_by = "gene_group", main = "qPCR", key.title = "CP")

Can you see any outlying samples (replicates)? If yes, should they be removed?

Normalization to reference genes

We will use housekeeping genes as reference genes. Housekeeping genes are genes which have similar expression profile regardless the cell type or state. That is why they can be used for sample normalization (as an internal control).

Selecting the reference genes

Correlation

Expression of housekeeping genes is/should be correlated.

main_title <- "Raw data correlation"

t(cp) %>%
  as.data.frame() %>%
  GGally::ggpairs(progress = FALSE) +
  theme_bw()

You can also use base R:

pairs(t(cp))

Let’s look more closely at correlation coefficients:

(gene_cor <- cor(t(cp), use = "pairwise.complete.obs"))
##                 CD24        CD31       CXCR4       VEGFA     ACTB H4    GAPDH H5   RPS9 H11     TBP H26
## CD24      1.00000000  0.03970592 -0.28809768  0.73649418  0.73547823  0.73141005 0.76319246  0.72419041
## CD31      0.03970592  1.00000000 -0.73606175  0.03747391  0.03172092  0.17532289 0.10746231  0.08438151
## CXCR4    -0.28809768 -0.73606175  1.00000000 -0.20149851 -0.18053811 -0.02185175 0.08141219 -0.25461239
## VEGFA     0.73649418  0.03747391 -0.20149851  1.00000000  0.77468336  0.82725327 0.83395485  0.77534719
## ACTB H4   0.73547823  0.03172092 -0.18053811  0.77468336  1.00000000  0.92233124 0.89743784  0.93865311
## GAPDH H5  0.73141005  0.17532289 -0.02185175  0.82725327  0.92233124  1.00000000 0.96305770  0.93496580
## RPS9 H11  0.76319246  0.10746231  0.08141219  0.83395485  0.89743784  0.96305770 1.00000000  0.90135469
## TBP H26   0.72419041  0.08438151 -0.25461239  0.77534719  0.93865311  0.93496580 0.90135469  1.00000000
ggcorrplot::ggcorrplot(gene_cor, method = "circle") +
  ggtitle(main_title)

You can also use lattice:

lattice::levelplot(gene_cor, main = main_title)

Or heatmaply:

plot_heatmaply(gene_cor, feature_data = genes_df, row_color_by = "gene_group", main = main_title)

Housekeeping genes should cluster together. We can try both Euclidean and Pearson distance (1 - correlation coefficient) for hiearchical clustering.

plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "euclidean", main = "Hierarchical clustering (Euclidean distance)")

plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "pearson", main = "Hierarchical clustering (Pearson distance)")

PCA of correlation coefficients is also helpful:

plot_pca_ggplot2(gene_cor, genes_df, color_by = "gene_group")$plot

geNorm M-values (TASK 4)

M values are telling you how stable is a gene expressed across the samples. From the original paper:

“This measure relies on the principle that the expression ratio of two ideal internal control genes is identical in all samples, regardless of the experimental condition or cell type. In this way, variation of the expression ratios of two real-life housekeeping genes reflects the fact that one (or both) of the genes is (are) not constantly expressed, with increasing variation in ratio corresponding to decreasing expression stability.”

For a further explanation see the section “Gene-stability measure and ranking of selected housekeeping genes” in the paper above.

TASK 4: Implement a function which computes the M value. Hints:

  • Remember that R is vectorized by default. In this way you can directly write most of equations as you see them on paper.
  • Very useful is apply() function. It applies a function to each of rows or columns of a matrix. Whether to apply function to rows or columns can be specified by MARGIN parameter: 1 for rows and 2 for columns. That is, when MARGIN = 1, you obtain N values from matrix with N rows. Similarly for MARGIN = 2, but for M columns.

Let’s compute the M values:

for (g in ref_genes)
  compute_m(g, cp_plot[ref_genes, ]) %>% glue("{g}: ", format(., digits = 3)) %>% cat(sep = "\n")
## 0.587524037202414ACTB H4: 0.588
## 0.675998845667576GAPDH H5: 0.676
## 0.73511830732382RPS9 H11: 0.735
## 0.545368225741302TBP H26: 0.545

All reference genes seem to be OK. But compare them with target genes - they would not be good reference genes:

for (g in target_genes)
  compute_m(g, cp_plot[target_genes, ]) %>% glue("{g}: ", format(., digits = 3)) %>% cat(sep = "\n")
## 1.62202887577501CD24: 1.62
## 2.50323654673845CD31: 2.5
## 2.18054789182113CXCR4: 2.18
## 1.59255501294409VEGFA: 1.59

Subtracting the reference genes: \(\Delta C_P\)

For each sample we calculate a geometric mean of reference genes. Don’t be confused, it’s just this formula applied to reference genes of each sample.

norm <- apply(cp[ref_genes, ], 2, function(x) {
  log(x) %>% mean() %>% exp()
})

head(norm)
## c1_50_r1 c1_50_r2 c1_90_r1 c1_90_r2 c2_50_r1 c2_50_r2 
## 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331

Let’s look how these geometric means of reference genes behave. As you can see, they lie in the centre of reference genes, and that’s what we want.

gene_cor_norm <- rbind(cp, norm = norm) %>%
  t() %>%
  cor(use = "pairwise.complete.obs")

genes_df_norm <- genes_df %>%
  dplyr::mutate(gene_group = factor(gene_group, levels = c("reference", "target", "norm"))) %>%
  rbind(norm = c("norm", "norm"))

ggcorrplot::ggcorrplot(gene_cor_norm, method = "circle") +
  ggtitle(main_title)

plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "euclidean", main = "Hierarchical clustering (Euclidean distance)")

plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "pearson", main = "Hierarchical clustering (Pearson distance)")

plot_pca_ggplot2(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plot

Now we calculate \(\Delta C_P\) values by subtracting the reference gene means from target genes. First, we create a matrix of the same size as cp, with reference gene mean of each sample in columns.

ref_gene_cp_matrix <- matrix(norm, ncol = ncol(cp), nrow = nrow(cp), byrow = TRUE)
ref_gene_cp_matrix[1:5, 1:5]
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [2,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [3,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [4,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [5,] 23.69656 23.90002 24.05644 24.08679 23.30099

Now we are ready for subtraction, that is, to calculate \(\Delta C_t = C_P^t - C_P^r\):

delta_cp <- cp - ref_gene_cp_matrix
delta_cp[1:5, 1:5]

Now we average the technical replicates. First, we create an empty matrix, with the same number of rows as delta_cp matrix, but with the same number of columns (and column names) as the number of biological replicates (next time I will call them just “samples”).

samples <- unique(pheno_data$sample_name)

delta_cp_avg <- matrix(NA, nrow = nrow(delta_cp), ncol = length(samples), dimnames = list(rownames(delta_cp), samples))
delta_cp_avg[1:5, 1:5]
##         c1_50 c1_90 c2_50 c2_90 c3_50
## CD24       NA    NA    NA    NA    NA
## CD31       NA    NA    NA    NA    NA
## CXCR4      NA    NA    NA    NA    NA
## VEGFA      NA    NA    NA    NA    NA
## ACTB H4    NA    NA    NA    NA    NA

We create a list in which names are samples and values are names of their corresponding technical replicates:

tech_rep <- split(colnames(delta_cp), pheno_data$sample_name)
head(tech_rep)
## $c1_50
## [1] "c1_50_r1" "c1_50_r2"
## 
## $c1_90
## [1] "c1_90_r1" "c1_90_r2"
## 
## $c2_50
## [1] "c2_50_r1" "c2_50_r2"
## 
## $c2_90
## [1] "c2_90_r1" "c2_90_r2"
## 
## $c3_50
## [1] "c3_50_r1" "c3_50_r2"
## 
## $c3_90
## [1] "c3_90_r1" "c3_90_r2"

And finally we start to fill in averages of technical replicates to our empty matrix delta_cp_avg. For each of the samples, we select columns from delta_cp matrix, corresponding to technical replicates of that sample, and calculate the mean of each row (= each gene). We add these means to sample’s column in delta_cp_avg.

for (s in samples)
  delta_cp_avg[, s] <- rowMeans(delta_cp[, tech_rep[[s]], drop = FALSE], na.rm = TRUE)

# For control.
(rowMeans(delta_cp[, c("c1_50_r1", "c1_50_r2")], na.rm = TRUE) == delta_cp_avg[, "c1_50"]) %>%
  all() %>%
  stopifnot()

Let’s finalize the \(\Delta C_P\) calculation. We keep only target genes and again, create a separate matrix for plotting purposes. We also add \(\Delta C_P\) to our long data.

delta_cp_avg <- delta_cp_avg[target_genes, ]
delta_cp_avg_plot <- replace(delta_cp_avg, is.na(delta_cp_avg), ceiling(max(delta_cp_avg, na.rm = TRUE) + 1))

# Remove replicates.
pheno_data_avg <- dplyr::filter(pheno_data, replicate == 1) %>%
  dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
  dplyr::select(sample_name, everything()) %>%
  set_rownames(.$sample_name)

cp_long <- dplyr::filter(cp_long, replicate == 1) %>%
  dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
  dplyr::select(sample_name, everything())

# Add delta_cp_avg
cp_long <- as.data.frame(delta_cp_avg) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp") %>%
  dplyr::right_join(cp_long, by = c("sample_name", "gene"))

# Add delta_cp_avg_plot
cp_long <- as.data.frame(delta_cp_avg_plot) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp_plot") %>%
  dplyr::right_join(cp_long, by = c("sample_name", "gene"))

cp_long_target <- dplyr::filter(cp_long, gene_group == "target")

Exploratory analysis on summarised and normalised data

You can check for outliers, as these could mess up a differential expression analysis.

plot_pca_ggplot2(delta_cp_avg_plot, pheno_data_avg, color_by = "sample_group", shape_by = "confluence")$plot

for (variable in color_variables) {
  plot_hc(
    delta_cp_avg_plot,
    color_by = pheno_data_avg[, variable],
    method_distance = "euclidean",
    method_clustering = "complete",
    color_by_lab = variable
  )
}

plot_pheatmap(delta_cp_avg_plot, sample_data = pheno_data_avg, column_color_by = color_variables, main = "qPCR")


Calculating (\(log_2\)) fold-changes

Now we calculate the relative change in gene expression between sample groups. For the sake of simplicity, we don’t take into account other variables, e.g. confluence. Also from the exploratory analysis you can see that these variables don’t have much effect on the separation of samples - sample group has the largest influence.

First, we calculate the \(mean(C_P)\) of control samples:

(control_means <- cp_long_target %>%
  dplyr::filter(sample_group == "control") %>%
  dplyr::group_by(gene) %>%
  dplyr::summarise(delta_cp_control_mean = mean(delta_cp, na.rm = TRUE))
)

Then we join this table with our long data and calculate:

  • \(\Delta \Delta C_P = \Delta C_{P, treatment} - \text{mean}(\Delta C_P, control)\)
  • \(log_2(\text{fold-change}) = -\Delta \Delta C_P\)
  • \(\text{fold-change} = 2^{LFC}\)
cp_long_target <- cp_long_target %>%
  dplyr::left_join(control_means, by = "gene") %>%
  dplyr::mutate(
    delta_delta_cp = delta_cp - delta_cp_control_mean,
    lfc = -delta_delta_cp,
    fc = 2^lfc
  ) %>%
  dplyr::select(sample_name, gene, cp, cp_plot, delta_cp, delta_cp_plot, delta_cp_control_mean, delta_delta_cp, lfc, fc, everything())

head(cp_long_target)

\(\text{mean}(\Delta \Delta C_{P, control})\) is now \(0\), because we subtracted \(\text{mean}(\Delta C_P, control)\) also from the control samples. Of course this also applies for \(LFC\), which is just \(-\Delta \Delta C_P\). As you will see, this is useful for plotting (boxplots).

Boxplots (TASK 5)

Boxplots are very informative when you want to see a difference between groups.

TASK 5: Implement a function to produce a boxplot of values. It should work for any type of long data, as you will be able to specify the grouping column to split boxplots on x-axe, and column with values to plot on y-axe.

First we will plot the \(\Delta C_P\) values:

plot_boxplot_ggplot2(
    cp_long_target,
    x = "sample_group",
    y = "delta_cp",
    feature_col = "gene",
    color_by = "sample_group",
    main = "delta Cp values",
    y_lab = "delta Cp"
)

And this is tricky to interpret at the first sight, because in spirulina group, all genes are actually upregulated! Also, you cannot directly see the \(LFCs\) - in your head, you have to calculate the difference between spirulina and control and flip the sign.

So better is to plot the \(LFCs\). Why not the \(\text{fold-changes}\)? Because \(LFCs\) are easier to read, as changes (relative to controls) are multiplicative and symmetric:

\(\text{fold-change}\) 0.125 0.25 0.5 1 2 4 8
\(log_2(\text{fold-change})\) -3 -2 -1 0 1 2 3
You read: Change is… 8x less 4x less 2x less no change 2x more 4x more 8x more

This figure summarises how \(LFCs\) were calculated and why it is better for boxplots :)

plot_boxplot_ggplot2(
    cp_long_target,
    x = "sample_group",
    y = "lfc",
    feature_col = "gene",
    color_by = "sample_group",
    main = "log2 fold-changes",
    y_lab = "log2 fold-change"
)

We can also split plots by confluence:

for (g in target_genes) {
  p1 <- plot_boxplot_ggplot2(
    cp_long_target,
    x = "sample_group",
    y = "lfc",
    feature_col = "gene",
    feature = g,
    main = "log2 fold-changes",
    color_by = "sample_group",
    y_lab = "log2 fold-change"
  )
  
  p2 <- plot_boxplot_ggplot2(
    cp_long_target,
    x = "sample_group",
    y = "lfc",
    feature_col = "gene",
    feature = g,
    main = "log2 fold-changes",
    color_by = "sample_group",
    facet_by = "confluence",
    y_lab = "log2 fold-change"
  )
  
  p1_p2 <- cowplot::plot_grid(p1, p2, ncol = 2)
  
  print(p1_p2)
}

t-test (TASK 6)

t-test is testing whether two means, coming from samples having Student’s distribution, significantly differ. t-test assumes that the input \(C_P\) values are normally distributed and the variance between conditions is comparable. Wilcoxon test can be used when sample size is small and those two last assumptions are hard to achieve.

As we will use two-sided alternative (i.e. gene is up- or downregulated = deregulated), it doesn’t matter which one of \(\Delta C_P\), \(\Delta \Delta C_P\) or \(LFC\) we will use. In all cases, the absolute difference between group means remains the same:

group_means <- dplyr::filter(cp_long_target, gene == "CD24") %>%
  dplyr::group_by(sample_group) %>%
  dplyr::summarise(
    delta_cp_mean = mean(delta_cp),
    delta_delta_cp_mean = mean(delta_delta_cp),
    lfc_mean = mean(lfc)
  ) %>%
  as.data.frame() %>%
  tibble::column_to_rownames("sample_group")

group_means
apply(group_means, MARGIN = 2, FUN = function(x) abs(x[1] - x[2]))
##       delta_cp_mean delta_delta_cp_mean            lfc_mean 
##            1.057847            1.057847            1.057847

TASK 6: Implement a function which do the t-test of gene from two groups. Hints:

  • Look at the formula parameter of t.test function.
for (g in target_genes) {
  test_results <- test_gene(
    gene = g,
    gene_data = cp_long_target,
    gene_col = "gene",
    value_col = "lfc",
    group_col = "sample_group",
    test = t.test,
    verbose = TRUE
  )
  cat("\n")
}
## CD24: spirulina vs. control
## t.test p-value: 0.000959 ***
## 
## CD31: spirulina vs. control
## t.test p-value: 0.00535 **
## 
## CXCR4: spirulina vs. control
## t.test p-value: 0.168 NS
## 
## VEGFA: spirulina vs. control
## t.test p-value: 0.00234 **

Or you can return a table for all genes:

(test_table <- test_gene_table(
  gene_data = cp_long_target,
  gene_col = "gene",
  value_col = "delta_cp",
  group_col = "sample_group"
))

Cleanup

save.image("qPCR.RData")

warnings()
traceback()
## No traceback available
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.7.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] friendlyeval_0.1.1 gplots_3.0.3       ggplot2_3.3.0      rlist_0.4.6.1      RColorBrewer_1.1-2 magrittr_1.5       tidyr_1.0.2        stringr_1.4.0      dplyr_0.8.4       
## [10] glue_1.3.1         knitr_1.28         rmarkdown_2.1     
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1      httr_1.4.1         jsonlite_1.6.1     viridisLite_0.3.0  foreach_1.4.8      gtools_3.8.1       shiny_1.4.0        assertthat_0.2.1   yaml_2.2.1        
## [10] ggrepel_0.8.1      pillar_1.4.3       digest_0.6.25      ggsignif_0.6.0     promises_1.1.0     colorspace_1.4-1   cowplot_1.0.0      htmltools_0.4.0    httpuv_1.5.2      
## [19] plyr_1.8.6         pkgconfig_2.0.3    ggcorrplot_0.1.3   pheatmap_1.0.12    xtable_1.8-4       purrr_0.3.3        scales_1.1.0       webshot_0.5.2      gdata_2.18.0      
## [28] later_1.0.0        tibble_2.1.3       farver_2.0.3       ggpubr_0.2.5       ellipsis_0.3.0     withr_2.1.2        lazyeval_0.2.2     mime_0.9           crayon_1.3.4      
## [37] heatmaply_1.0.0    evaluate_0.14      GGally_1.4.0       MASS_7.3-51.5      ggthemes_4.2.0     tools_3.6.3        registry_0.5-1     data.table_1.12.8  lifecycle_0.2.0   
## [46] plotly_4.9.2       munsell_0.5.0      cluster_2.1.0      packrat_0.5.0      compiler_3.6.3     caTools_1.18.0     rlang_0.4.5        grid_3.6.3         iterators_1.0.12  
## [55] htmlwidgets_1.5.1  crosstalk_1.0.0    bitops_1.0-6       labeling_0.3       gtable_0.3.0       codetools_0.2-16   reshape_0.8.8      TSP_1.1-9          reshape2_1.4.3    
## [64] R6_2.4.1           seriation_1.2-8    gridExtra_2.3      fastmap_1.0.1      KernSmooth_2.23-16 dendextend_1.13.4  stringi_1.4.6      Rcpp_1.0.3         vctrs_0.2.3       
## [73] gclus_1.3.2        tidyselect_1.0.0   xfun_0.12

HTML rendering

This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell :)

library(rmarkdown)
library(knitr)
library(glue)

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE)
render("qPCR.Rmd", output_file = "qPCR.html")